Documentation of the data set:

■ attribution of the owner/creator of the data: https://www.kaggle.com/aljarah

■ links to the data: https://www.kaggle.com/aljarah/xAPI-Edu-Data?select=xAPI-Edu-Data.csv

## [1] 480  17

As can be seen here from the output obtained,the dataset consists of 17 columns. However,as the class is our response variable, we consider only the rest of the 16 features and try to visualise and understand their effect on class data set.

1 Gender - student’s gender (nominal: ‘Male’ or ‘Female’)

2 Nationality- student’s nationality (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)

3 Place of birth- student’s Place of birth (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)

4 Educational Stages- educational level student belongs (nominal: ‘lowerlevel’,’MiddleSchool’,’HighSchool’)

5 Grade Levels- grade student belongs (nominal: ‘G-01’, ‘G-02’, ‘G-03’, ‘G-04’, ‘G-05’, ‘G-06’, ‘G-07’, ‘G-08’, ‘G-09’, ‘G-10’, ‘G-11’, ‘G-12 ‘)

6 Section ID- classroom student belongs (nominal:’A’,’B’,’C’)

7 Topic- course topic (nominal:’ English’,’ Spanish’, ‘French’,’ Arabic’,’ IT’,’ Math’,’ Chemistry’, ‘Biology’, ‘Science’,’ History’,’ Quran’,’ Geology’)

8 Semester- school year semester (nominal:’ First’,’ Second’)

9 Parent responsible for student (nominal:’mother’,’father’)

10 Raised hand- how many times the student raises his/her hand on classroom (numeric:0-100)

11- Visited resources- how many times the student visits a course content(numeric:0-100)

12 Viewing announcements-how many times the student checks the new announcements(numeric:0-100)

13 Discussion groups- how many times the student participate on discussion groups (numeric:0-100)

14 Parent Answering Survey- parent answered the surveys which are provided from school or not (nominal:’Yes’,’No’)

15 Parent School Satisfaction- the Degree of parent satisfaction from school(nominal:’Yes’,’No’)

16 Student Absence Days-the number of absence days for each student (nominal: above-7, under-7)

17 Class- The students are classified into three numerical intervals based on their total grade/mark: L(Low-Level):interval includes values from 0 to 69, M(Middle-Level):interval includes values from 70 to 89, H(High-Level):interval includes values from 90-100.

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:purrr':
## 
##     compact
## corrplot 0.90 loaded
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## Warning: package 'ggthemes' was built under R version 4.1.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Warning: package 'randomForest' was built under R version 4.1.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Warning: package 'party' was built under R version 4.1.2
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## 
## Attaching package: 'modeltools'
## The following object is masked from 'package:plyr':
## 
##     empty
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.1.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.1.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.1.2
## 
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
## 
##     boundary
## Warning: package 'plotly' was built under R version 4.1.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:plyr':
## 
##     arrange, mutate, rename, summarise
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Warning: package 'rpart' was built under R version 4.1.2
## Warning: package 'rpart.plot' was built under R version 4.1.2
## 'data.frame':    480 obs. of  17 variables:
##  $ gender     : chr  "M" "M" "M" "M" ...
##  $ nation     : chr  "KW" "KW" "KW" "KW" ...
##  $ birthplace : chr  "KuwaIT" "KuwaIT" "KuwaIT" "KuwaIT" ...
##  $ stageid    : chr  "lowerlevel" "lowerlevel" "lowerlevel" "lowerlevel" ...
##  $ gradeid    : chr  "G-04" "G-04" "G-04" "G-04" ...
##  $ sectionid  : chr  "A" "A" "A" "A" ...
##  $ topic      : chr  "IT" "IT" "IT" "IT" ...
##  $ semester   : chr  "F" "F" "F" "F" ...
##  $ relation   : chr  "Father" "Father" "Father" "Father" ...
##  $ raisedhands: int  15 20 10 30 40 42 35 50 12 70 ...
##  $ n_visit    : int  16 20 7 25 50 30 12 10 21 80 ...
##  $ n_view     : int  2 3 0 5 12 13 0 15 16 25 ...
##  $ discussion : int  20 25 30 35 50 70 17 22 50 70 ...
##  $ p_answer   : chr  "Yes" "Yes" "No" "No" ...
##  $ p_satis    : chr  "Good" "Good" "Bad" "Bad" ...
##  $ n_absent   : chr  "Under-7" "Under-7" "Above-7" "Above-7" ...
##  $ class      : chr  "M" "M" "L" "L" ...

First of all, let’s see a glimpse of our Dataset

## Rows: 480
## Columns: 17
## $ gender      <chr> "M", "M", "M", "M", "M", "F", "M", "M", "F", "F", "M", "M"~
## $ nation      <chr> "KW", "KW", "KW", "KW", "KW", "KW", "KW", "KW", "KW", "KW"~
## $ birthplace  <chr> "KuwaIT", "KuwaIT", "KuwaIT", "KuwaIT", "KuwaIT", "KuwaIT"~
## $ stageid     <chr> "lowerlevel", "lowerlevel", "lowerlevel", "lowerlevel", "l~
## $ gradeid     <chr> "G-04", "G-04", "G-04", "G-04", "G-04", "G-04", "G-07", "G~
## $ sectionid   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "A", "B"~
## $ topic       <chr> "IT", "IT", "IT", "IT", "IT", "IT", "Math", "Math", "Math"~
## $ semester    <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F"~
## $ relation    <chr> "Father", "Father", "Father", "Father", "Father", "Father"~
## $ raisedhands <int> 15, 20, 10, 30, 40, 42, 35, 50, 12, 70, 50, 19, 5, 20, 62,~
## $ n_visit     <int> 16, 20, 7, 25, 50, 30, 12, 10, 21, 80, 88, 6, 1, 14, 70, 4~
## $ n_view      <int> 2, 3, 0, 5, 12, 13, 0, 15, 16, 25, 30, 19, 0, 12, 44, 22, ~
## $ discussion  <int> 20, 25, 30, 35, 50, 70, 17, 22, 50, 70, 80, 12, 11, 19, 60~
## $ p_answer    <chr> "Yes", "Yes", "No", "No", "No", "Yes", "No", "Yes", "Yes",~
## $ p_satis     <chr> "Good", "Good", "Bad", "Bad", "Bad", "Bad", "Bad", "Good",~
## $ n_absent    <chr> "Under-7", "Under-7", "Above-7", "Above-7", "Above-7", "Ab~
## $ class       <chr> "M", "M", "L", "L", "M", "M", "L", "M", "M", "M", "H", "M"~
##      gender      nation  birthplace     stageid     gradeid   sectionid 
##           0           0           0           0           0           0 
##       topic    semester    relation raisedhands     n_visit      n_view 
##           0           0           0           0           0           0 
##  discussion    p_answer     p_satis    n_absent       class 
##           0           0           0           0           0

We found that there are NO missing values in Dataset. So, no need to remove anything.

The variables are thereby renamed for easy interpretation.

##     gender             nation           birthplace          stageid         
##  Length:480         Length:480         Length:480         Length:480        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    gradeid           sectionid            topic             semester        
##  Length:480         Length:480         Length:480         Length:480        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    relation          raisedhands        n_visit         n_view     
##  Length:480         Min.   :  0.00   Min.   : 0.0   Min.   : 0.00  
##  Class :character   1st Qu.: 15.75   1st Qu.:20.0   1st Qu.:14.00  
##  Mode  :character   Median : 50.00   Median :65.0   Median :33.00  
##                     Mean   : 46.77   Mean   :54.8   Mean   :37.92  
##                     3rd Qu.: 75.00   3rd Qu.:84.0   3rd Qu.:58.00  
##                     Max.   :100.00   Max.   :99.0   Max.   :98.00  
##    discussion      p_answer           p_satis            n_absent        
##  Min.   : 1.00   Length:480         Length:480         Length:480        
##  1st Qu.:20.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :43.28                                                           
##  3rd Qu.:70.00                                                           
##  Max.   :99.00                                                           
##  class  
##  L:127  
##  M:211  
##  H:142  
##         
##         
## 
##   gender nation birthplace    stageid gradeid sectionid topic semester relation
## 1      M     KW     KuwaIT lowerlevel    G-04         A    IT        F   Father
## 2      M     KW     KuwaIT lowerlevel    G-04         A    IT        F   Father
## 3      M     KW     KuwaIT lowerlevel    G-04         A    IT        F   Father
## 4      M     KW     KuwaIT lowerlevel    G-04         A    IT        F   Father
## 5      M     KW     KuwaIT lowerlevel    G-04         A    IT        F   Father
## 6      F     KW     KuwaIT lowerlevel    G-04         A    IT        F   Father
##   raisedhands n_visit n_view discussion p_answer p_satis n_absent class
## 1          15      16      2         20      Yes    Good  Under-7     M
## 2          20      20      3         25      Yes    Good  Under-7     M
## 3          10       7      0         30       No     Bad  Above-7     L
## 4          30      25      5         35       No     Bad  Above-7     L
## 5          40      50     12         50       No     Bad  Above-7     M
## 6          42      30     13         70      Yes     Bad  Above-7     M
##     gender nation birthplace      stageid gradeid sectionid     topic semester
## 475      F Jordan     Jordan MiddleSchool    G-08         A Chemistry        F
## 476      F Jordan     Jordan MiddleSchool    G-08         A Chemistry        S
## 477      F Jordan     Jordan MiddleSchool    G-08         A   Geology        F
## 478      F Jordan     Jordan MiddleSchool    G-08         A   Geology        S
## 479      F Jordan     Jordan MiddleSchool    G-08         A   History        F
## 480      F Jordan     Jordan MiddleSchool    G-08         A   History        S
##     relation raisedhands n_visit n_view discussion p_answer p_satis n_absent
## 475   Father           2       7      4          8       No     Bad  Above-7
## 476   Father           5       4      5          8       No     Bad  Above-7
## 477   Father          50      77     14         28       No     Bad  Under-7
## 478   Father          55      74     25         29       No     Bad  Under-7
## 479   Father          30      17     14         57       No     Bad  Above-7
## 480   Father          35      14     23         62       No     Bad  Above-7
##     class
## 475     L
## 476     L
## 477     M
## 478     M
## 479     L
## 480     L

It can be concluded that the female students have outperformed the male students according to the data we have. This is depicted visually in the graph above.

The graph reveals that Jordan and Kuwait are over-represented in our sample when compared to other nationalities. Egypt, Iran, Lybia, Morocco, Syria, Tunis, USA and Venezuela have very few observations.

Chemistry has the least diversity among all the topics. Most of the enrolled students in chemistry are from Jordan. Also, most of the students who have pursued IT are from Kuwait. Topics like French, English and Arabic have the most diversity.

Students have been performing really well in biology. We can also note that no student has scored less than 70 in Geology.

Most of the data available is for the middle school students and high school has very few observations. Next we will try to note if these levels have considerable effect on our class variable.

The school levels do not tend to affect the performance of the students considerably. Majority of the students score between the range of 70-89 irrespective of their schooling level.

The majority of the students seem to score between 70-89 irrespective of the semesters . However,the proportion of students scoring more than 89 is higher in the second semester.

The variables n_visit and raisedhands have quite a significant correlation between themselves. Hence, the students who have been visiting the resources continuously are more likely to raise hands in the classes than the ones who didnt.

The performance of the students seems to be dependent on the number of times they raised their hands. This can be a measure of their involvement in the class.

The female students can be observed to have raised hands more than the male students. This seems to concur with the idea that the number of hands raise is a potential factor to determine the academic performance of the students.

Geology can be said to have been the most engaging subjects of all. IT on the other hand has extremely less student participation.

## Confusion Matrix and Statistics
## 
##      Predicted
## Truth   L   M   H
##     L  83  15   1
##     M   9 142  19
##     H   0  26  89
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8177         
##                  95% CI : (0.7754, 0.855)
##     No Information Rate : 0.4766         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.7162         
##                                          
##  Mcnemar's Test P-Value : 0.3094         
## 
## Statistics by Class:
## 
##                      Class: L Class: M Class: H
## Sensitivity            0.9022   0.7760   0.8165
## Specificity            0.9452   0.8607   0.9055
## Pos Pred Value         0.8384   0.8353   0.7739
## Neg Pred Value         0.9684   0.8084   0.9257
## Prevalence             0.2396   0.4766   0.2839
## Detection Rate         0.2161   0.3698   0.2318
## Detection Prevalence   0.2578   0.4427   0.2995
## Balanced Accuracy      0.9237   0.8183   0.8610
## Confusion Matrix and Statistics
## 
##      Predicted
## Truth  L  M  H
##     L 19  9  0
##     M  2 31  8
##     H  0  7 20
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7292          
##                  95% CI : (0.6289, 0.8148)
##     No Information Rate : 0.4896          
##     P-Value [Acc > NIR] : 1.542e-06       
##                                           
##                   Kappa : 0.5802          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: L Class: M Class: H
## Sensitivity            0.9048   0.6596   0.7143
## Specificity            0.8800   0.7959   0.8971
## Pos Pred Value         0.6786   0.7561   0.7407
## Neg Pred Value         0.9706   0.7091   0.8841
## Prevalence             0.2188   0.4896   0.2917
## Detection Rate         0.1979   0.3229   0.2083
## Detection Prevalence   0.2917   0.4271   0.2812
## Balanced Accuracy      0.8924   0.7277   0.8057

Que What I have Learned and Done.?

Ans. In the result it shows that trained data has the best accuracy as compared to the test data. Also, this algorithm use to make future prediction about studetns learning process and spot students who are unsuccessful. Moreover, from this output one can examine the studetns capabilities work on students who doesn’t performed well in the class as per the data given. The data firstly pre-processed and presented with different exploratory data analysis Second, a correlation analysis is used to investigate the relationships between certain characteristics and class variable and then splitted and modeled by a decision tree algorithm and give results as per the given data.

Que What I will do to improve and my thoughts.?

Ans. Some column will be deleted. Both Grade ID and Stage ID showed the educational stage of students, and the Stage ID was divided into 12 category, which is unessisarly difficult to analyst. Therefore, Stage ID will be deleted. And PlaceofBirth will be deleted out of similary reason, which is similary to Natuinallty. And SectionID which presents the classrooms belonging of students and Semester also will be deleted. I will divide the remaining 13 factors into three categories. 1.Demographic characteristics 2.Academic background characteristics 3. Behavior characteristics. At the same time, I will also analyze the possible internal connections between each column.